refactor(tests): split suites by execution level and speed up CI#1609
refactor(tests): split suites by execution level and speed up CI#1609
Conversation
|
@claude review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: eb98e3ad52
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6fed925e43
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 170beb8201
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: daaf571e78
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@claude review |
|
@codex review |
|
Codex Review: Didn't find any major issues. Hooray! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7c38f2a197
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| item.add_marker(pytest.mark.e2e) | ||
| item.add_marker(pytest.mark.live_provider) | ||
|
|
||
|
|
||
| @pytest.fixture(autouse=True) | ||
| def reset_langfuse_state(request: pytest.FixtureRequest) -> Iterable[None]: | ||
| if request.node.get_closest_marker("live_provider") is not None: | ||
| yield | ||
| return | ||
|
|
||
| LangfuseResourceManager.reset() | ||
| yield | ||
| LangfuseResourceManager.reset() |
There was a problem hiding this comment.
🔴 The new autouse reset_langfuse_state fixture (tests/conftest.py:65-77) calls LangfuseResourceManager.reset() before and after every non-live_provider test, which shuts down ALL registered resource managers globally — including the one backing the module-level langfuse = Langfuse() instance created at import time in tests/live_provider/test_openai.py:11. When a mixed pytest session runs tests/unit before tests/live_provider, every unit test invalidates that module-level instance; subsequent live_provider calls to langfuse.flush() operate on a shut-down manager, silently dropping all spans and causing live_provider assertions to fail. Fix by limiting the reset fixture's scope to tests/unit only (e.g. skip when the test is not marked unit), or by replacing the module-level client in test_openai.py with a per-test fixture.
Extended reasoning...
What the bug is and how it manifests
The reset_langfuse_state fixture introduced by this PR is declared autouse=True and applies to every test in the session. It correctly short-circuits for tests carrying the live_provider marker (lines 66-68), but it still fires — both before and after — for every unit and e2e test. Each invocation calls LangfuseResourceManager.reset(), which iterates over every key in _instances, calls shutdown() on each resource manager, then clears _instances entirely.
The specific code path that triggers it
- pytest collection imports
tests/live_provider/test_openai.pyat module scope, executing line 11:langfuse = Langfuse(). This creates aLangfuseResourceManagerkeyed to the environment's public key and registers it in_instances. - The first unit test runs.
reset_langfuse_statefires its setup phase, callingLangfuseResourceManager.reset(). This callsshutdown()on the module-level manager (which callsflush(), then stops worker threads), then removes it from_instances. - The
langfusemodule-level variable intest_openai.pystill holds a reference to the now-shut-downLangfuseclient. The client's_resourcesattribute points to the deadLangfuseResourceManager— there is no lazy re-initialization path (confirmed by langfuse/_client/client.py line 325). - When any live_provider test calls
langfuse.flush()(which happens dozens of times throughout test_openai.py), it callsforce_flush()on the old, already-shut-down tracer provider. Spans recorded by OpenAI instrumentation after the reset go to a newly-created resource manager (created implicitly on the nextLangfuse()call from OpenAI instrumentation), but that manager is never explicitly flushed bylangfuse.flush().
Why existing safeguards do not prevent it
The fixture correctly guards live_provider tests from resetting each other. However, the guard fires on the test being run, not on all tests in the session. A unit test running before a live_provider test has already fired the global reset, and no mechanism can undo a shutdown after the fact. The Langfuse client does not attempt to reconnect or re-acquire a new resource manager when its stored _resources instance has been shut down.
Impact
Running pytest tests/unit tests/live_provider locally (a natural workflow for a developer iterating on both suites) silently breaks every live_provider test that uses the module-level langfuse object. Assertions like assert len(generation.data) != 0 will fail because spans are never exported. In CI the suites run in separate jobs and separate processes, so the regression is invisible there — making it an insidious local-only failure that can block developer productivity and erode confidence in the test suite.
Step-by-step proof
pytest tests/unit tests/live_provider— collection imports test_openai.py; module-levelLangfuse()registers manager M0 in_instances.- First unit test starts;
reset_langfuse_state(setup) callsLangfuseResourceManager.reset(): M0 is shut down and_instancesis cleared. - Unit test body runs; unit test teardown calls
LangfuseResourceManager.reset()again (no-op,_instancesalready empty). - live_provider test
test_openai_chat_completionruns; it callsopenai.chat.completions.create(...)which records spans via a new resource manager M1 (created implicitly by the OpenAI instrumentation's internalget_client()). - Test calls
langfuse.flush()(the module-level instance) →M0.flush()→M0.tracer_provider.force_flush()on the dead provider → no spans from M1 are exported. get_api().legacy.observations_v1.get_many()returns empty;assert len(generation.data) != 0fails.
How to fix it
Restrict reset_langfuse_state to unit tests only — the only suite that genuinely needs a clean slate between tests:
@pytest.fixture(autouse=True)
def reset_langfuse_state(request: pytest.FixtureRequest) -> Iterable[None]:
if request.node.get_closest_marker('unit') is None:
yield
return
LangfuseResourceManager.reset()
yield
LangfuseResourceManager.reset()Alternatively, move the module-level langfuse = Langfuse() in test_openai.py into a module-scoped or function-scoped fixture so it is always initialized after any reset.
| with pytest.raises(NotFoundError): | ||
| langfuse.api.dataset_items.get(item.id) |
There was a problem hiding this comment.
🟡 The final NotFoundError assertion in test_upsert_and_get_dataset_item (lines 151-152) calls langfuse.api.dataset_items.get(item.id) directly with no retry, immediately after a wait_for_result that only guarantees the item has disappeared from the /datasets/{name}/items list endpoint — a different server code path than /dataset-items/{id}. If the server propagates the ARCHIVED status to these two endpoints at slightly different times, the direct get() may not yet raise NotFoundError, causing a spurious test failure. Wrapping the assertion in a retry predicate that loops until NotFoundError is raised would make this consistent with every other assertion in the same test.
Extended reasoning...
What the bug is and how it manifests
The test test_upsert_and_get_dataset_item (tests/e2e/test_datasets.py) archives a dataset item and then uses two sequential checks to confirm the archival has taken effect. First, it calls wait_for_result (lines 142-147) polling langfuse.get_dataset(name) until the item no longer appears in the dataset's item list. Then, immediately afterward (lines 151-152), it calls langfuse.api.dataset_items.get(item.id) with a bare pytest.raises(NotFoundError) and no retry.
The specific code path that triggers it
langfuse.get_dataset(name) ultimately queries the server's GET /datasets/{name}/items list endpoint, which filters results by status = ACTIVE. langfuse.api.dataset_items.get(item.id) queries the server's GET /dataset-items/{id} direct-get endpoint. Although both endpoints read from the same backing database, these are distinct server code paths with their own query logic. In a deployment using async queue-based ingestion (LANGFUSE_INGESTION_QUEUE_DELAY_MS=10 is set in CI), it is possible — if uncommon — for the ARCHIVED status to propagate to the list-filter query before it propagates to the direct-get 404 guard, causing the direct get to return the item rather than raising NotFoundError.
Why existing code does not prevent it
The wait_for_result on lines 142-147 is explicitly scoped to the list endpoint. Once it returns, all we know is that get_dataset() no longer returns the item. There is no guarantee that the direct-get endpoint has also begun returning 404 for the same item ID. The pytest.raises block on lines 151-152 does not retry; it makes a single synchronous call, so any window of inconsistency between the two endpoints results in a test failure.
Addressing the refutation
The refuting verifier argues that a single database write marks the item ARCHIVED and both endpoints share the same backing data, so the scenario requires separate caches — which is speculative. This is a valid point for a simple synchronous system. However, the system under test uses a queue-based ingestion pipeline (LANGFUSE_INGESTION_QUEUE_DELAY_MS=10) that can introduce processing order dependencies. More importantly, the PR's own author demonstrably understands there is an eventual-consistency window here: every single other read assertion in this test was wrapped in wait_for_result, including earlier calls to the same langfuse.api.dataset_items.get() endpoint (lines 102-105 and 119-127). The omission on lines 151-152 is structurally inconsistent with the pattern the PR establishes, regardless of whether the probability of failure is low.
Impact
The consequence is intermittent CI flakiness — the test passes most of the time but can fail spuriously when the direct-get endpoint has not yet started returning 404 for the archived item. This is test-only code, so there is no production correctness risk.
Step-by-step proof
create_dataset_item(..., status=DatasetStatus.ARCHIVED)is called at line 134.- The ingestion queue processes the archival event asynchronously.
wait_for_result(lines 142-147) pollsGET /datasets/{name}/itemsand returns as soon as that endpoint filters out the item.- At this point the direct-get endpoint
GET /dataset-items/{id}may have not yet received or processed the same status update through its code path. langfuse.api.dataset_items.get(item.id)at line 152 returns the item successfully instead of raisingNotFoundError.pytest.raises(NotFoundError)fails with "did not raise".
Recommended fix
Wrap the final assertion in a retry predicate that loops until NotFoundError is raised, analogous to the inverted pattern needed when the terminal condition is an exception rather than a successful result. This makes the assertion consistent with the rest of the test and robust to endpoint propagation timing.
Summary
This PR restructures the test suite around execution level and follows through on the CI and reliability work needed to make that split useful in practice.
tests/unittests/e2etests/live_providertests/supportAGENTS.mdand removeCLAUDE.mdCI and test infrastructure
This also updates CI and test plumbing so the new layout stays fast and reliable.
tests/uniton the Python 3.10-3.14 matrixtests/e2ein 2 mechanical shards selected byscripts/select_e2e_shard.pytests/live_provideras one dedicated suite that always runsserial_e2eas the scheduling-specific pytest markere2e_core/e2e_datamarker routing with mechanical file shardingcancel-in-progress: trueLANGFUSE_INIT_*Unit test speedups
The unit-suite speed work keeps coverage intact rather than weakening assertions.
--dist workstealfor the unit laneWhy
The repo already had a mix of local-only, real-server, and live-provider tests, but the old flat
tests/layout made the boundary hard to see and easy to erode. This change makes the split explicit in the filesystem, keeps unit coverage fast, and keeps the expensive e2e and provider surface intentional.Results
Local measurements after the unit refactor:
tests/unit/test_otel.py+tests/unit/test_propagate_attributes.py:215.08s->2.38stests/unitrun:360 passed, 2 skipped in 13.42s-n auto --dist worksteal:360 passed, 2 skipped in 5.42sValidation
uv run --frozen pytest -q tests/unit --maxfail=1uv run --frozen pytest -q -n auto --dist worksteal tests/unit --maxfail=1uv run --frozen pytest --collect-only -q tests/unit tests/e2e tests/live_provideruv run --frozen pytest -q tests/unit/test_e2e_sharding.pyuv run --frozen pytest --collect-only -q tests/live_provider -m 'live_provider'uv run --frozen ruff check scripts/select_e2e_shard.py tests/conftest.py tests/unit/test_e2e_sharding.py tests/e2e/test_core_sdk.pyCI is the authoritative validation for the real Langfuse-server and live-provider suites.
Disclaimer: Experimental PR review
Greptile Summary
This PR is a large, well-engineered refactoring that splits the flat
tests/directory intotests/unit,tests/e2e,tests/live_provider, andtests/support, and rewires CI to match. It also meaningfully speeds up the unit suite (360 tests in ~5 s) by replacing polling-based waits with sentinel-driven shutdown and deterministic queue draining, and adds proper thread-safety locks toPromptCache. The structural changes are clean and the new retry/sharding helpers are solid.Key changes:
langfuse/_utils/prompt_cache.py:PromptCacheandPromptCacheTaskManagerare now fully protected byRLock; consumer switches from a 1-second polling loop to a blockingqueue.get()with sentinel-based shutdown;add_refresh_prompt_task_if_current()avoids redundant background refreshes.langfuse/_task_manager/media_upload_consumer.py/score_ingestion_consumer.py/media_manager.py: sentinel-drivenpause()/signal_shutdown()for immediate worker wake-up on shutdown.scripts/select_e2e_shard.py: greedy weight-based shard assignment that falls back to local AST-based test counting for unknown files.tests/support/retry.py: bounded retry helper for e2e eventual-consistency reads..github/workflows/ci.yml:unit-testsjob runs across Python 3.10–3.14;e2e-testsjob runs two mechanical shards + live-provider suite; Langfuse server bootstrapped viaLANGFUSE_INIT_*env vars instead of a manual seeder script.Confidence Score: 5/5
Safe to merge — all findings are P2 style/improvement items with no production correctness risk under default configuration
The core source changes (sentinel-based shutdown, RLock thread-safety, prompt cache freshness guard) are sound and well-tested. The multi-consumer sentinel issue only affects threads > 1 which is not the default. The retry edge case requires a specific timing race that is unlikely in practice. All remaining findings are style, robustness improvements, or minor logic nuances — none block correctness on the happy path.
tests/support/retry.py (stale-error edge case), langfuse/_utils/prompt_cache.py (multi-consumer shutdown), scripts/select_e2e_shard.py (class-based test counting)
Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD PR[Pull Request / Push] --> C{Concurrency group} C -->|cancel-in-progress| LINT[linting job] C --> TC[type-checking job] C --> UNIT[unit-tests\nPython 3.10–3.14 matrix] C --> E2E[e2e-tests matrix] E2E --> S1[Shard 1\nselect_e2e_shard --shard-index 0] E2E --> S2[Shard 2\nselect_e2e_shard --shard-index 1] E2E --> LP[live_provider suite] S1 --> PAR1[pytest -n 4 --dist worksteal\nnot serial_e2e] S1 --> SER1[pytest serial_e2e] S2 --> PAR2[pytest -n 4 --dist worksteal\nnot serial_e2e] S2 --> SER2[pytest serial_e2e] LP --> LPT[pytest -n 4 --dist worksteal\ntests/live_provider -m live_provider] UNIT --> |pytest -n auto --dist worksteal| UT[tests/unit 360 tests] LINT --> GATE[all-tests-passed] TC --> GATE UNIT --> GATE S1 --> GATE S2 --> GATE LP --> GATE GATE --> MERGE[Branch protection check]Reviews (1): Last reviewed commit: "fix(prompt-cache): avoid redundant refre..." | Re-trigger Greptile
Context used:
Learnt From
langfuse/langfuse-python#1387